In this tutorial, we will be analyzing sample N1 from a dataset of tumor and immune cell populations from human early-stage lung adenocarcinomas (He et al., Oncogene 2021). Data has been obtained using 10X Genomics technology. The sample N1 is composed of 14,972 single cells that were sequenced on the Illumina Hiseq X platform. Raw data can be download from here. Prior to the analysis with popsicleR, raw data has been processed with Cell Ranger software to align reads and generate feature-barcode matrices. Cell Ranger data, comprising the gzipped TSV files of the feature-barcode matrix and of feature and barcode sequences, can be found here.
We start assigning a name to the sample (i.e., N1) and defining the path to the folder containing raw input data. popsicleR accepts as input either files from the Cell Ranger pipeline of 10X Genomics or a feature-barcode matrix of raw counts generated from any microfluidic-, microwell plates-, or droplet-based scRNA-seq technology. To organize input, it is convenient to create a directory named as the sample name and containing a sub-folder with the input data (either in the form of Cell Ranger gzipped TSV files or of a tab-delimited matrix). In this example, the Cell Ranger input files are stored in the cellranger_data folder. Some steps of the analysis that involve Seurat functions can be performed using parallelization following the instructions reported here.
library(popsicleR)
## Hello and welcome to popsicleR, an interactive Pipeline Of Preprocessing for SIngle CelL Experiment data.
## In this workflow, messages are colour coded:
## green messages will provide information on the ongoing step
## cyan messages will require the user to provide an input for advancing the script
## red messages will report missing information or wrong inputs
## grey messages report functions and software system outputs.
# set the working directory and sample name
sample.name <- "N1"
main.dir <- file.path("~", "PoPsicleExample", sample.name)
input.data.dir <- file.path(main.dir, "cellranger_data")
setwd(main.dir)
We start the analysis using the PrePlots function. PrePlots first reads the output of the cellranger pipeline from 10X through the Read10X() function of the Seurat package and returns a unique molecular identified (UMI) count matrix. The values in the count matrix represent the number of molecules for each feature (i.e., gene; row) that are detected in each cell (column). During the generation of the count matrix, a soft filter is applied to remove low quality cells. By default, PrePlots removes genes that are expressed in less than 10% of cells (percentage = 0.1) and cells that express less than 200 genes (gene_filter = 200). Additional parameters specify the organism (organism = "human") and if data has been generated using cellranger (cellranger = TRUE). If cellranger is set to FALSE, a matrix file can be provided through the input_matrix parameter. PrePlots returns violin, density, and scatter plots to easily explore QC metrics and set user-defined filtering thresholds. These plots report commonly used QC metrics as the number of unique genes detected in each cell (nFeature_RNA), the total number of molecules detected within a cell (nCount_RNA), the percentage of reads that map to the mitochondrial genome (percent_mt), and to ribosomal (percent_ribo) and dissociation genes (percent_disso). To help the user avoiding the removal of cells intrinsically characterized by outlying levels of unique genes, total UMIs, and mitochondrial fractions, PrePlots also returns the cell abundance and the distributions of these features within specific cell populations identified by a list of user-defined marker genes (e.g., SFTPC for alveolar cells and CD3D for T cells). All plots are saved in the 01.QC_Plots folder within the working directory.
# Define a list of cell-type markers
populations.markers <- c("EPCAM", "VIM", "COL1A1", "PECAM1", "PTPRC", "CD3D", "CD14", "SFTPC")
# Create a Seurat object from the raw data and visualize QC metrics
sample.umi <- PrePlots(sample = sample.name, input_data = input.data.dir, genelist = populations.markers,
percentage = 0.1, gene_filter = 200, cellranger = TRUE, input_matrix = NULL, organism = "human",
out_folder = main.dir)
##
## Plotting QC Violin plots
## Plots saved in: 01.QC_Plots\01a_QC_violin_plots.pdf
## Plotting QC Density plots
## Plots saved in: 01.QC_Plots\01b_QC_Hist_nGene_nUMI_MTf_Ribo.pdf
## Plotting QC Scatter plots
## Plots saved in: 01.QC_Plots\01c_QC_Scatter_nGene_nUMI_MTf.pdf
## Plotting QC per gene Histograms
## Plots saved in: 01.QC_Plots\01d_QC_Hist_Check.pdf
## Plotting QC per gene Scatter plots
## Plots saved in: 01.QC_Plots\01e_QC_Scatter_Check.pdf
##
## Now check the graphs, choose your thresholds and then run FilterPlots